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SPEAKER ADAPTATION USING WEIGHTED FEEDBACK 

Background 

Technical Field of the Invention: The present invention relates to speech 
recognition systems and, more particularly, to speaker adaptation using feedback. 

Background Art: Speech recognition systems using only Speaker Independent 
(SI) models are very sensitive to different speakers due to speaker characteristic 
variations. SI models typically use a Hidden Markov Model (HMM). Speaker 
adaptation is a process to adapt a SI model to a speaker dependent (SD) model to 
capture the physical characteristics of a given speaker. Speaker adaptation techniques 
can be used in supervised and unsupervised mode. In supervised mode, the correct 
transcription is known, while in unsupervised mode, no correct transcription is 
available. 

For reliable and robust speaker adaptation, large amounts of adaptation data are 
often required in order to cover the linguistic units of a given language. However for 
most practical applications, only a limited amount of adaptation data is available. 
Efficient use of the adaptation data becomes extremely important. The traditional 
adaptation schemes treat all the adaptation data indiscriminately, which results in some 
parts of the adaptation data being relatively under-trained or under-weighted. Usually 
the under represented words are more unlikely to be recognized by the decoder. 

The traditional adaptation scheme is as follows: 

1. Given some adaptation enrollment data and a SI model, collect statistics 
on the enrollment idata and perform speaker adaptation on the SI model. 

2. Decoding the test utterances with the adapted acoustic model. 
Such a scheme uses the enrollment data only once and does not incorporate any 
feedback from decoding. It is fast in practice, but does not always provide good 
performance. 

Approaches to speaker adaptation include those described in J. L. Gauvain et al. 
'^Maximum a posteriori estimation for multivariate Gaussian mixture observations of 
Markov Chain," IEEE Trans. On Speech and Audio Processing, Vol. 2, pp, 291-298; 
L.R. Bahl, et al., "A New Algorithm for the estimation of Hidden Markov Model 
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Parameters," THEE International Conference on Acoustics, Speech, and Signal 
Processing, pp. 493-496, 1988; and C,L. Leggetter et al., "Maximum likelihood linear 
regression for speaker adaptation of continuous density HMMs," Computer Speech and 
Language, VoL9, pp. 171-185, 1995. In some of these approaches, errors included in 
5 recognizing a particular speaker's utterances are not considered. In a "corrective 
. training" approach, such as in the above-recited L. R. Bahl et al. article, an error in 
recognition of the utterance may be considered, but a very complicated technique is 
used to compensate for it. Background articles on expectation maximization (EM) 
maximum likelihood (ML) are provided in the articles A.P. Dempster, et al., 
10 "Maximum likelihood from incomplete data via the EM algorithm," Journal of the 
Royal statistical Society, Series B 39, pp. 1^38, 1977; and N. Laird, *Tlie EM 
algorithm," Handbook of Statistics, vol. 9. Elsevier Science Publishers B.V. 1993. 

An iterative technique in speech recognition is to recognize utterances based on 
an SI model and to create an SD model therefrom and then to apply the SD model to 
15 recognizing the utterances to create a more refined SD model and so forth. 

There is a need for improved techniques for speaker adaptation. Such improved 
techniques are described in this disclosure. 

Brief Description of the Drawings 
20 The invention will be understood more fully fh>m the detailed description given 

below and from the accompanying drawings of embodiments of the invention which, 
however, should not be taken to limit the invention to the specific embodiments 
described, but are for explanation and understanding only. 

FIG. 1 is a partial flow and partial block diagram representation of some 
25 embodiments of the invention. 

FIG. 2 illustrates a segment (e.g., a phone) of the utterances which includes 
multiple frames. 

FIG. 3 illustrates a section (e.g., word) of the utterances which includes multiple 
segments (e.g., phones). 
30 FIG. 4 is partial flow and partial block diagram representation similar to a 

portion of FIG. 1, but may allow multiple feedback passes. 
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FIG. 3 is a high level schematic block diagram representation of a computer 
system that may be used in connection with some embodiments of the invention. 

FIG. 6 is a high level schematic representation of a hand-held computer system 
that may be used in connection with some embodiments of the invention. 

5 

Detailed Description 
The present invention involves speaker adaptation whereby characteristics of an 
SI model can be adapted through consideration of adaptation enrollment data from a 
particular speaker to create an SD model. More particularly, the adaptation enrollment 

10 data is weighted according to errors detected in the recognized utterances. For those 
words (or utterances in the enrollment data set) that are not well learnt by speaker 
adaptation, as indicated by misrecognizing those words, the invention provides a way to 
incorporate the decoding feedback so that these words can be better adapted. When 
only limited amounts of enrollment data are available, this scheme of iterative 

15 bootstrapping makes better use of that limited data. The scheme can be extended to the 

unsupervised adaptation where references may contain errors. In some embodiments, 
an iterative adaptation scheme dynaroically adjusts enrollment data to incoiporate 
feedback from decoding on the enrollment data. 

In the following disclosure, the term "some embodiments" or "other 

20 embodiments" means that a particular feature, structure, or characteristic described in 
connection with the embodiments is included in at least some embodiments, but not 
necessarily all embodiments, of the invention. The various appearances "some 
embodiments" are not necessarily aU referring to the same embodiments. 
In the following disclosure, when the term phone is used, it could include all phonemes 

25 in a particular language or less than all the phonemes. To reduce complexity, some 

speech recognition systems do not recognize every phoneme in a particular language. 

The following four parts are used in some embodiments of the invention. A 
fifth part is used in still other embodiments. 

1 . Denote M as the initial SI (speaker independent) model and A as the 
30 enrollment data set. 

2. Perform speech recognition on data set A based on model M. 
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3. Adjust A to A* according to the decoding results from part 2. 
Emphasizing or de-emphasizing certain parts of A with weights based on these results. 
The emphasizing/de-emphasizing is achieved by assigning a weight to each word in the 
adaptation data. How to calculate the weight will be discussed below. 

5 4. Adapt model M to M' using enrollment data A*. 

5. (optional) Repeat parts 3 and 4 with the updated . 
For example, FIG. 1 is represents some embodiments of the invention in a 
diagram which is partially a flow diagram and partially a block diagram. A dashed line 
represents a dividing line between acts occurring during an adaptation mode and a 

10 recognition (decoding) mode. The recognition phase occurs after the SD model is 
created in the adaptation phase. Note that niicrophone 14, processing block 18, and 
recognition block 22 are shown above and below the dashed line and may represent the 
same blocks at different times (before and after the conclusion of adaptation). In this 
disclosure, a block may be hardware or a combination of hardware and software. 

15 Referring to FIG. 1 above the dashed line, a speaker input such as microphone 

14 receives utterances of a particular speaker. The utterances are converted to digital 
signals U and may be otherwise processed according to weU know techniques by 
processing block 18. Note that microphone 14 may be adjacent to the computer system 
that performs the acts illustrated in FIG. 1 or microphone may be remote from it. For 

20 example, microphone 14 may be in a telephone or remote other systemi. Processing 

block 18 provides the processed utterances U to a recognition block 22 and a weighting 
block 30. Utterances U may be stored in a wave file as a collection of utterances. Of 
course, there may be spaces of silence or lack of speech between the sections of the 
utterances. 

25 Recognition block 22 produces a recognized (hypothesized) phone string H 

based on the utterances U and an SI model. In a comparison and weight calculating 
block 26, recognized phone string H is compared with a reference (tme) phone string R. 
The reference phone string is what the speaker is requested to read. A word-phone 
dictionary may be used to convert the reference word string into phones. Of course, 

30 there may be silences or lack of speech in the recognized and reference phone strings. 
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Differences between the recognized and reference phone string can be 
determined in a variety of ways. In some embodiments, speech features in the 
recognized and reference phone strings are compared on a frame by frame leveL 
Merely as an example, the frames may be on the order of about 25 milliseconds (nas), 
5 although various other frame durations could be used. A phone may be around 200 

milliseconds, although various other phone durations are possible. Accordingly, in 
some embodiments, there may be on the order of 10 frames per phone. A force 
alignment algorithm may be used to mark the time or place each phone (or word) 
happens in the utterances. The frames may contain a Gaussian feature vector. 

10 For example, referring to FIG, 2, portions of the reference string and recognized 

string for a series of frames are illustrated. The frames are arbitrarily labeled FX, F2, 
F3, FX-2, EX-1, and FX, wherein there may be several frames between frames F3 
andFX-2. The portions in a frame may be a feature extraction. Each portion (e.g., 
feature extraction) has characteristics (e.g., Gaussian), which are labeled "C." The 

15 particular number after the "C" is arbitrarily chosen. For example, in frame Fl , both 

the reference and recognized strings have characteristics C4. Accordin^y, the 
comparison indicates that the characteristics of the reference and recognized strings are 
the same (S) for frame Fl. In frame F2, the characteristic of the reference string is C15 
and the characteristic of the recognized string is Cll. Accordingly, the comparison 

20 indicates that the characteristics of the reference and recognized strings are the different 

(D) for frame F2. (Merely as an example, S could be "0" and D could be "1," or 
various other schemes could be used.) Likewise, in frames F3, FX-2, and FX-1, the 
characteristics are the same and in frame FX, the characteristics are different. 

In some embodiments, a certain number of frames forms a segment. The 

25 segment may be a phone or other portion of the utterance. Referring to HG. 2, as an 

example, a segment 1 may be formed of frames Fl, F2, F3, . . ., FX-2, FX-1, FX, As 
illustrated in HG. 3, a section of the utterances may be formed of multiple segments. In 
some embodiments, the section is a word, although the invention is not so lin[iited. 
Segments of silence or lack of speech can be used to indicate the boundary of a word. 
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If a word includes a phone having an error (the characteristics of a frame of the 
reference and hypothesis in the word are different (see FIG. 2)), then the word is 
considered an error word, and the weight of the word is calculated. 

The weights assigned to the sections of utterances U may be calculated in block 
5 26 through various techniques. The following are some examples, although the 

invention is not limited to the examples. 

In some embodiments, the weight value for each word is estimated from the 
likelihood information of the references (the true input word string) and hypotheses (the 
word string decoded by the recognizer, may contain errors). 
10 1. Run a force alignment program on the reference stream to get statistics 

of the references. 

2. Decode the utterance to get statistics of the 1-best hypothesis. 

3. Align the 1-best hypothesis with the reference sentence to obtain the 
error words. 

15 4. Calculate the average likelihood difference per frame according to the 

equation (1) as follows: 



where is the log likelihood of hypothesis word n, is the beginning frame 

index (in time), and is the end frame index. RL^^b and^^ are the reference 
20 countCT parts. Of course, the invention is not limited to the details of equation (1). 

Note that equation (1) involves likelihoods, which are not necessarily probabilities. 
Equation (1) could be modified to involve probabilities. 

Next, the weight value Wi for misrecognized words of a particular speaker * Y' is 

obtained by averagmg ^ over all the misrecognized words (error words) according to 
25 equation (2) as follows: 

- ^ ^ %\Ln\ (2), 

wherein m may be the number of noisrecognized words. Of course, the invention is not 
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limited to the details of equation (2). In equation (2), the sections are for words (e.g., 
"W" refers to words), but the sections could be something else. (See HG. 3.) 

Note that in embodiments using equation (2), each misrecognized word may 
have the same averaged weight. Alternatively, different words could have different 
5 wei^ts throu^ applying the result of equation (1) more directiy. 

Once the weights are calculated, the weights and places of error are provided to 
block 30. The reference string (or at least the portions of the reference string 
corresponding to the errors in the recognized stream) is communicated to block 30. In 
block 30, the utterance U is marked with the errors and corresponding weights are 
10 noted. The adaptation enrollment data (E) includes the marked utterances with 

corresponding weights (wU) and at least those portions of the reference stream (R*) 
that correspond to the errors in the utt^ance. The SI model and SD model may be 
Gaussian mixtures. The wavefile U may be transformed (e.g., through an FPT) from 
the time domain to the frequency donudn. The weight w may be expressed as a floating 
15 point nimiber. 

In adaptation box 34, the adaptation enrollment data is applied with the SI 
model to create the SD model according to known techniques, except that the 
enrollment data may have additional weights. In some embodiments, in the adaptation 
box 34, the error words are added w times to the SI model. In some embodiments, 
20 these weights are added to those of the SI model, although the invention is not limited 
to this. More complicated schemes could be used, but are not required. 

It is important to not give too much weight to the enrollment data, because they 
are based on limited sampling. 

In the above described embodiments, weights are only calculated for words for 
25 which there is an error in recognition. Alternatively, there could be weights (e.g., 

negative weights) for correctiy recognized words. Note that in different embodiments 
the weights can be positive or negative depending on the scheme chosen. 

Once the SD model is calculated in an adaptation mode, it is applied on path 40 
for use by block 22 in a recognition mode, below the dashed line. 
30 FIG. 4 illustrates that the feedback can be performed more than once until 

differences between H and R are less than a threshold (see decision block 36). (It could 
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be less than or equal to a threshold.) To determine whether the differences between H 
and R are less than a threshold, the various errors can be summed and then compared to 
a single threshold or different errors can be compared to separate thresholds. Other 
approaches could be used. 
5 Note that in FIG. 4, the input to recognition block 22 changes with each pass. 

The utterances may be stored for re-use. Tlie inputs to the adaptation block 24 also 
changes, however, if the difference between H and R is less than a threshold, then the 
previous enrollment data is the one applied to path 40 for use during recognition mode 
(decoding). 

10 There are a variety of computer systems that may be used in training and using a 

speech recognition system. Merely as an example, FIG. 5 illustrates a highly schematic 
representation of a computer system 100 which includes a processor 114, memory 116, 
and input/output and control block 118. There may be a substantially amount of 
memory in processor 1 14 and memory 1 16 may represent both memory that is off the 

15 chip of processor 1 14 or memory that is partially on and partially off the chip of 

processor 114. (Or memory 116 could be completely on the chip of processor 114). At 
least some of the input/output and control block 118 could be on the same chip as 
processor 114, or be on a separate chip. A microphone 126, monitor 130, additional 
memory 134, and input devices (such as a keyboard and mouse 138), a network 

20 connection 142, and speaker(s) 144 may interface with input/output and control block 
118. Memory 134 represents a variety of memory such as a hard drive and CD ROM or 
DVD discs. It is emphasized that the system of FIG. 1 is merely exemplary and the 
invention is not limited to use with such a computer system. Computer system 100 and 
other computer systems used to carry out the invention may be in a variety of forms, 

25 such as desktop, mainframe, and portable computers. 

For example, FIG. 6 illustrates a handheld device 160, with a display 162, which 
may incorporate some or all the features of FIG. 5. The hand held device may at times 
interface with anoth^ computer system, such as that of FIG. 5, The shapes and relative 
sizes of the objects in FIG. S and 6 are not intended to suggest actual sh£q)es and 

30 relative sizes. 
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Various memories mentioned above (e.g., CD-ROM, flash memory, hard-drive) 
include computer readable storage mediums on which in instructions may be stored 
which when executed cause some embodiments of the invention to occur. 

If this disclosure states a component, feature, stracture, or characteristic "may", 
5 "might", or "could" be included, that particular component, feature, structure, or 

characteristic is not required to be included. If the specification or claim refers to "a" or 
"an" element, that does not mean there is only one of the element. If the specification 
or claims refer to "an additional" element, that does not preclude there being more than 
one of the additional element. 
10 Those skilled in the art having the benefit of this disclosure will appreciate that 

many other variations from the foiegoing description and drawings may be made within 
the scope of the present invention. Accordingly, it is the following claims including 
any amendments thereto that define the scope of the invention, 



9 



wo 02/01549 



PCT/CNOO/00158 



CLAIMS 

What is claimed is: 

1 ; A method comprising: 

(a) calculating festimated weights for identified errors in recognition of 
5 utterances; 

(b) marldng sections of the utterances as being misrecognized and associating 
the corresponding estimated weights with these sections of the utterances; and 

(c) using the weighted sections of the utterances to convert a speaker 
independent model to a speaker dependent model. 

10 2. The method of claim 1, wherein parts (a) — (c) are repeated at least once. 

3. The method of claim 1, wherein the utterances are converted into a 
recognized phone string a first time through applying the speaker independent model 
and thCTeafter throu^ applying the most recently obtained speaker dependent model. 

4. The method of claim 1, wherein the estimated weights are computed 
15 through computing an average likelihood difference per frame and then computing a 

weight value by averaging the average likelihood difference over all the error words. 

5. The method of claim 1, wherein average likelihood difference per fi:ame 
is used to calculate the estimated weights and is computed according to the equation (1) 
as follows: 

where is the log likelihood of hypothesis word n, is the beginning firame 

index (in time), and is the end frame index, and Ri^^b and^" are counter parts 
for a reference string. 

6. The method of claim 5, wherein the weight for misrecognized words of a 
25 particular speaker "i" is calculated according to equation (2) as follows: 

wherein m a number of misrecognized words. 
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7. 



The method of claim 1, wherein for a particular speaker, different 



misrecognized words may have a different weight. 

8. A method comprising: 

(a) recognizing utterances through converting the utterances into a recognized 
phone string; 

(b) comparing the recognized phone string with a reference phone string; 

(c) calculating estimated weights for sections of the utterances; 

(d) marking errors in the utterances and providing corresponding estimated 
weights to form adaptation enrollment data; and 

(e) using the adaptation enrollment data to convert a speaker independent model 
to a speaker dependent model. 

9. The method of claim 8, wherein the utterances are converted into the 
recognized phone string through applying the speaker independent model. 

10. The method of claim 8, wherein parts (b) — (e) are repeated until 
differences between the reference and recognized strings are less than a threshold. 

11. The method of claim 8, wherein the utterances are converted into a 
recognized phone string a first time through applying the speaker independent model 
and thereafter through applying the most recently obtained speaker dependent model. 

12. The method of claim 8, wherein the estimated weights are computed 
tiirou^ computing an average likelihood difference pa frame and then computing a 
weight value by averaging the average likelihood difference over all the error words, 

13. The method of claim 8, wherein an average likelihood difference per 
frame is used to calculate the estimated weights and is calculated according to the 
equation (1) as follows: 



Hi ^ 



tig --tt^ Kg -K^ 



(1), 



where fff is the log likelihood of hypothesis word n, 



is the beginning frame 



index (in time), and is the end frame index, and , Rj^ and are counter parts 
for the refermce string. 
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14. The method of claim 13, wherein the weight for misrecognized words of 
a particular speaker "i" is calculated according to equation (2) as follows: 



1 

/i = l 



m (2), 



wherein m a number of misrecognized words. 

15. The method of claim 8, wherein for a particular speaker, different 
misrecognized words may have a different weight. 

16. A memory comprising: 

a storage medium having instructions thereon which when executed cause a 
computer system to perform the following method: 

(a) calculating estimated weights for identified errors in recognition of 
utterances; 

(b) marking sections of the utterances as being misrecognized and associating 
the corresponding estimated weights with these sections of the utterances; and 

(c) using the weighted sections of the utterances to convert a speaker 
independent model to a speaker dependent model. 

17. The method of claim 16, wherein parts (a) — (c) are repeated at least 

once. 

18. The method of claim 16, wherein the utterances are converted into a 
recognized phone string a first time through applying the speaker independent model 
and thereafter through applying the most recently obtained speaker dependent model. 

19. The method of claim 16, wherein the estimated weights are computed 
tiirough computing an average likelihood difference per frame and then computing a 
weight value by averaging the average likelihood difference over all the error words. 

20. The method of claim 16, wherein average likelihood difference per 
frame is used to calculate the estimated weights and is computed according to the 
equation (1) as follows: 

fit 
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where ^^l is the log likelihood of hypothesis word n, is the beginning frame 

index (in time), and is the end frame index, and Rt^^b andi?" are counter parts 
for a reference string. 

21 . The method of claim 20, wherein the weight for misrecognized words of 
a particular speaker "i" is calculated according to equation (2) as follows: 

- i ' Jl.^l (2). 

wherein m a number of misrecognized words. 

22. Hie method of claim 16, wherein for a particular speaker, different 
nodsrecognized words may have a different weight. 

23. A memory comprising: 

a storage medium having instructions thereon which when executed cause a 
computer system to perform the following method: 

(a) recognizing utterances tfaiougjh converting the utterances into a recognized 
phone string; 

(b) comparing the recognized phone string with a reference phone string; 

(c) calculating estimated weights for sections of the utterances; 

(d) marking errors in the utterances and providing corresponding estimated 
weights to form adaptation enrollment data; and 

(e) using the adaptation enrollment data to convert a speaker independent model 
to a speaker dependent model. 

24. The method of claim 23, wherein the utterances are converted into the 
recognized phone string through applying the speaker independent model. 

25. The method of claim 23, wherein parts (b) - (e) are repeated until 
differences between the reference and recognized strings are less than a threshold. 

26. The method of claim 23, wherein the utterances are converted into a 
recognized phone string a first time through applying the speaker independent model 
and thereafter through applying the most recently obtained speaker dependent model. 
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27. The method of claim 23, wherein the estimated weights are computed 
through computing an average likelihood difference per frame and then computing a 
weight value by averaging the average likelihood difference over all the error words, 

28. The method of claim 23, wherein an average likelihood difference per 
frame is used to calculate the estimated weights and is calcidated according to the 
equation (1) as follows: 

where H£ is the log likelihood of hypothesis word n, is the beginning frame 

index (in time), and is the end frame index, and , R^ and Re are counter parts 
for the reference string. 

29. The method of claim 28, wherein the weight for mdsrecognized words of 
a particular speaker "i" is calculated according to equation (2) as follows: 



1 m 



m (2). 



wherein m a number of misrecognized words, 

30. The method of claim 23, wherein for a particular speaker, different 
misrecognized words may have a different weight. 
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